XFT: Practical Fault Tolerance beyond Crashes

نویسندگان

  • Shengyun Liu
  • Paolo Viotti
  • Christian Cachin
  • Vivien Quéma
  • Marko Vukolic
چکیده

Despite 30+ years of intensive research, the distributed computing community still does nothave a practical answer to non-crash faults of the machines that comprise a distributed system.In particular, Byzantine fault-tolerance (BFT), that promises to handle such faults, has not livedto expectations due to its resource and operation overhead with respect to its crash fault-tolerant(CFT) counterparts. This overhead comes from the worst-case assumption about Byzantine faults,in the sense that some coordinated adversarial activity controls the faulty machines and the entirenetwork at will. To practitioners, however, such strong attacks appear irrelevant.In this paper, we introduce XFT (“cross fault tolerance”), a novel approach to building reliabledistributed systems, that decouples the fault space across the machine and network faults dimen-sions, treating machine faults and network asynchrony separately. This is in sharp contrast to theexisting CFT and BFT models that discern system faults only along the machine fault dimension.XFT offers much more flexibility than traditional synchronous and asynchronous models that (toostrictly) fix the network fault model of interest regardless of the machine faults.As the showcase for XFT, we present Paxos++: the first state machine replication protocolin the XFT model. Paxos++ tolerates faults beyond crashes in an efficient and practical way,featuring many more nines of reliability than the celebrated crash-tolerant Paxos protocol, withoutimpacting its resource/operation costs while maintaining the same performance (common-casecommunication complexity among replicas). Surprisingly, Paxos++ sometimes (depending on thesystem environment) even offers strictly stronger reliability guarantees than state-of-the-art BFTreplication protocols.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

From Viewstamped Replication to Byzantine Fault Tolerance

The paper provides an historical perspective about two replication protocols, each of which was intended for practical deployment. The first is Viewstamped Replication, which was developed in the 1980’s and allows a group of replicas to continue to provide service in spite of a certain number of crashes among them. The second is an extension of Viewstamped Replication that allows the group to s...

متن کامل

Reliable Broadcast in a Computational Hybrid Model with Byzantine Faults, Crashes, and Recoveries

This paper presents a formal model for asynchronous distributed systems with parties that exhibit Byzantine faults or that crash and subsequently recover. Motivated by practical considerations, it represents an intermediate step between crash-recovery models for distributed computing and proactive security methods for tolerating arbitrary faults. The model is computational and based on complexi...

متن کامل

Comparison of Failure Detectors and Group Membership: Performance Study of Two Atomic Broadcast Algorithms

Protocols that solve agreement problems are essential building blocks for fault tolerant distributed systems. While many protocols have been published, little has been done to analyze their performance, especially the performance of their fault tolerance mechanisms. In this paper, we present a performance evaluation methodology that can be generalized to analyze many kinds of fault-tolerant alg...

متن کامل

Improved Fault Tolerant Elastic Scheduling Algorithm for Cloud Computing

The paper focus on Fault Tolerance, a long standing problem in cloud computing by extending Primary Backup model to include cloud features such as virtualization and elasticity. Fault tolerance is a challenging work in Cloud Computing as virtual machines are the basic computing instances rather than hosts that enable virtual machines to migrate to other hosts. The on demand provisioning of reso...

متن کامل

Measuring Fault Tolerance with the FTAPE Fault Injection Tool

This paper describes FTAPE (Fault Tolerance And Performance Eval-uator), a tool that can be used to compare fault-tolerant computers. The major parts of the tool include a system-wide fault injector, a workload generator, and a workload activity measurement tool. The workload creates high stress conditions on the machine. Using stress-based injection, the fault injector is able to utilize knowl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016